36 research outputs found
Brief Announcement: The Fault-Tolerant Cluster-Sending Problem
The development of fault-tolerant distributed systems that can tolerate Byzantine behavior has traditionally been focused on consensus protocols, which support fully-replicated designs. For the development of more sophisticated high-performance Byzantine distributed systems, more specialized fault-tolerant communication primitives are necessary, however.
In this brief announcement, we identify the cluster-sending problem - the problem of sending a message from one Byzantine cluster to another Byzantine cluster in a reliable manner - as such an essential communication primitive. We not only formalize this fundamental problem, but also establish lower bounds on the complexity of this problem under crash failures and Byzantine failures. Furthermore, we develop practical cluster-sending protocols that meet these lower bounds and, hence, have optimal complexity. As such, our work provides a strong foundation for the further exploration of novel designs that address challenges encountered in fault-tolerant distributed systems
Coordination-Free Byzantine Replication with Minimal Communication Costs
State-of-the-art fault-tolerant and federated data management systems rely on fully-replicated designs in which all participants have equivalent roles. Consequently, these systems have only limited scalability and are ill-suited for high-performance data management. As an alternative, we propose a hierarchical design in which a Byzantine cluster manages data, while an arbitrary number of learners can reliable learn these updates and use the corresponding data.
To realize our design, we propose the delayed-replication algorithm, an efficient solution to the Byzantine learner problem that is central to our design. The delayed-replication algorithm is coordination-free, scalable, and has minimal communication cost for all participants involved. In doing so, the delayed-broadcast algorithm opens the door to new high-performance fault-tolerant and federated data management systems. To illustrate this, we show that the delayed-replication algorithm is not only useful to support specialized learners, but can also be used to reduce the overall communication cost of permissioned blockchains and to improve their storage scalability
Brief Announcement: Revisiting Consensus Protocols through Wait-Free Parallelization
In this brief announcement, we propose a protocol-agnostic approach to improve the design of primary-backup consensus protocols. At the core of our approach is a novel wait-free design of running several instances of the underlying consensus protocol in parallel. To yield a high-performance parallelized design, we present coordination-free techniques to order operations across parallel instances, deal with instance failures, and assign clients to specific instances. Consequently, the design we present is able to reduce the load on individual instances and primaries, while also reducing the adverse effects of any malicious replicas. Our design is fine-tuned such that the instances coordinated by non-faulty replicas are wait-free: they can continuously make consensus decisions, independent of the behavior of any other instances
Stab-Forests: Dynamic Data Structures for Efficient Temporal Query Processing
Many sources of data have temporal start and end attributes or are created in a time-ordered manner. Hence, it is only natural to consider joining datasets based on these temporal attributes. To do so efficiently, several internal-memory temporal join algorithms have recently been proposed. Unfortunately, these join algorithms are designed to join entire datasets and cannot efficiently join skewed datasets in which only few events participate in the join result.
To support high-performance internal-memory temporal joins of skewed datasets, we propose the skip-join algorithm, which operates on stab-forests. The stab-forest is a novel dynamic data structure for indexing temporal data that allows efficient updates when events are appended in a time-based order. Our stab-forests efficiently support not only traditional temporal stab-queries, but also more general multi-stab-queries. We conducted an experimental evaluation to compare the skip-join algorithm with state-of-the-art techniques using real-world datasets. We observed that the skip-join algorithm outperforms other techniques by an order of magnitude when joining skewed datasets and delivers comparable performance to other techniques on non-skewed datasets
Practical View-Change-Less Protocol through Rapid View Synchronization
The emergence of blockchain technology has renewed the interest in
consensus-based data management systems that are resilient to failures. To
maximize throughput of these systems, we have recently seen several prototype
consensus solutions that optimize for throughput at the expense of overall
implementation complexity, high costs, and reliability. Due to this, it remains
unclear how these prototypes will perform in real-world environments. In this
paper, we present the Practical View-Change-Less Protocol PVP, a
high-throughput, simple, and reliable consensus protocol. Central to PVP is the
combination of (1) a chained consensus design for replicating requests with a
reduced message cost; (2) the novel Rapid View Synchronization protocol that
enables robust and low-cost failure recovery; and (3) a high-performance
concurrent consensus architecture in which independent instances of the chained
consensus operate concurrently to process requests with high throughput and
without single-replica bottlenecks. Due to the concurrent consensus
architecture, PVP greatly outperforms traditional primary-backup consensus
protocols such as PBFT (by up to 430%), Narwhal (by up to 296%), and HotStuff
(by up to 3803%). Due to its reduced message cost, PVP is even able to
outperform RCC, a state-of-the-art high-throughput concurrent consensus
protocol, by up to 23%. Furthermore, PVP is able to maintain a stable and low
latency and consistently high throughput even during failures.Comment: 16 pages, 14 figure
I/O efficient bisimulation partitioning on very large directed acyclic graphs
In this paper we introduce the first efficient external-memory algorithm to
compute the bisimilarity equivalence classes of a directed acyclic graph (DAG).
DAGs are commonly used to model data in a wide variety of practical
applications, ranging from XML documents and data provenance models, to web
taxonomies and scientific workflows. In the study of efficient reasoning over
massive graphs, the notion of node bisimilarity plays a central role. For
example, grouping together bisimilar nodes in an XML data set is the first step
in many sophisticated approaches to building indexing data structures for
efficient XPath query evaluation. To date, however, only internal-memory
bisimulation algorithms have been investigated. As the size of real-world DAG
data sets often exceeds available main memory, storage in external memory
becomes necessary. Hence, there is a practical need for an efficient approach
to computing bisimulation in external memory.
Our general algorithm has a worst-case IO-complexity of O(Sort(|N| + |E|)),
where |N| and |E| are the numbers of nodes and edges, resp., in the data graph
and Sort(n) is the number of accesses to external memory needed to sort an
input of size n. We also study specializations of this algorithm to common
variations of bisimulation for tree-structured XML data sets. We empirically
verify efficient performance of the algorithms on graphs and XML documents
having billions of nodes and edges, and find that the algorithms can process
such graphs efficiently even when very limited internal memory is available.
The proposed algorithms are simple enough for practical implementation and use,
and open the door for further study of external-memory bisimulation algorithms.
To this end, the full open-source C++ implementation has been made freely
available